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^ Multicore 


® Most significant development since 
consumer 3D 
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Multicore 


« Most significant development since 
consumer 3D 

® Explicit parallelism 


® Hardware problem becoming software 
problem will require new techniques 
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® The decisions faced with multiple cores 
® How we are approaching multiple cores 
® Algorithms and paradigms 
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Goals 


: j ® Integrate multicore across Valve’s 
business 


® Expose to game programmers, licensees and 
MOD authors 
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§&>. Goals 


© Integrate multicore across Valve’s 
business 

® Scale to cores without recompile 
© Create value beyond fra me rate 
® Apply cores to new gameplay 
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Challenges 


: i ® Games want maximal CPU utilization 


Mol 

®0Kj5£l ® Games are inherently serial 

; ® Decades of experience in single threaded 
optimization 

- ® Millions of lines of code written for single 
threading 
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® Threading model 
« Threading framework 


IVA LV E I 


© 2007 Valve Corporation. All Rights Reserved. 


WWW.GDCONF.COM 





Threading Models 

M 




® Fine grained threading 
® Coarse threading 
® Hybrid threading 
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Diving In 


© Client 

User input 
® Rendering 
® Graphics simulation 
© Server 
® Al 

® Physics 
® Game logic 
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^ Diving In 
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: j ® Experiment: run client and server each 
on own core 
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^ Diving In 

g 
m 


CO 1 


: j ® Experiment: run client and server each 
on own core 

® Benefits: forced to confront systems that 
are not thread safe or not thread efficient 
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Discoveries 


® Problem: shared data access 

® Global data 

® Static data (optimizations/function local state) 
® Singleton objects 


IVA LV E I 


© 2007 Valve Corporation. All Rights Reserved. 


WWW.GDCONF.COM 







Discoveries 

® Problem: shared data access 
® Thread safety is easy! 
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Discoveries 


: ® Problem: shared data access 
® Thread safety is easy! 


® Slap on a mutex/critical section 
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Discoveries 


: i ® Problem: shared data access 


Mol 

®S95j^ ® Bad thread safety is easy! 

® Slap on a mutex/critical section 
® The simple thing is the worst thing 
® Mutexes are terrible 
© Excessive waits 
® Error prone 
© Fail to scale 

® Establish slow but stable baseline 
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Discoveries 

: j ® Efficient thread safety 

® No synchronization (“wait-free”) 


® Each thread has a private copy of all the data 
needed to perform operation: 

« Threads working on independent problems 
® Replace globals with thread private data 
® Reorient to pipeline 


® Example: Source “Spatial Partition” 
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Discoveries 
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Discoveries 
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j ® Efficient thread safety 

® No synchronization (“wait-free”) 

® Better synchronization tools, techniques 


® Analyze data access 

® Example: symbol table using read/write lock 
® Decouple using queued function calls 
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Discoveries 


® What if you can’t eliminate contention 
over shared resources? 
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Results 

® Can approach 2x in contrived maps 
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Results 


CLIENT 


IDLE 



SERVER 
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Results 

! ® Can approach 2x in contrived maps 
® More like 1.2x in real single player 
® Applicable to 360 Team Fortress 2 
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Hybrid threading 


Use the appropriate tool for the job 

® Some systems on cores (e.g. sound) 

@ Some systems split internally in a coarse manner 
® Split expensive iterations across cores fine grained 
® Queue some work to run when a core goes idle 


® Need strong tools 
Maximal core utilization 
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Hybrid threading: Rendering 


Monitor 


Etc. 


Skybox 


Main View 


Scene List 



—[ Particles j 


1—| Sim & Draw ] 

—[ Character J 


—[ Bone Setup 

—| Draw 


H Ete 1 
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Problems 

® Per-view scene construction limits 
opportunity 

® Arbitrary object type order 
® Arbitrary code execution 


® Simulation and Rendering interleaved 

s Lazy calculation optimizations 
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« Iterative Transition: Skeletal Animation 
® Parallelize lazy calculation triggers 
© Refactor bone setup into single pass per view 
® Refactor into single pass for all views 
® Same pattern for other CPU-intensive stages 
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Revised pipeline 

® Construct scene rendering lists for multiple scenes in 
parallel (e.g., the world and its reflection in water) 


® Overlap graphics simulation 


® Compute character bone transformations for all 
characters in all scenes in parallel 
® Allow multiple threads to draw in parallel 
® Serialize drawing operations on another core 
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k Threading Tools 


. a © Implementing Hybrid Threading 




© Programmers solve game development 
problems, not threading problems 

© Empower all programmers to leverage cores 


® Operating system: too low level 
® Compiler extensions (OpenMP): too opaque 
© Tailored tools: correct abstraction 
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Tailored tools: Game 
Threading Infrastructure 

® Custom work management system 
® Intuitive for programmers 
® Focus on keeping cores busy 
® Thread pool: N-1 threads for N cores 
® Support hybrid threading 


® Function threading 

® Array parallelism 

® Queued and immediate execution 
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Tailored tools: Game 
Threading Infrastructure 

® Goal: make system easy to use, hard to 
mess up 

® Example: compiler generated functors 


® Uses templates to package up functions and 
data, point of call looks very similar 

® Call arrives on other end as if called normally 

® Saves time, reduces error, encourages 
experimentation 
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Tailored tools: Game 
Threading Infrastructure 

® One-off push to another core 


if ( !isEngineThreaded() ) 

_Host_RunFrame_Server( numticks ); 


el se 

Th read Execute( _Host_Run F rame_Se rve r, 


numticks ); 
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Tailored tools: Game 
Threading Infrastructure 

® Parallel loop 


void ProcessPSystem( CParticleEffect *pEffect ); 


ParallelProcess( parti clesToSimulate.Base(), 
parti clesToSimulate.Count(), 
ProcessPSystem ); 
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Tailored tools: Game 
Threading Infrastructure 

® Queue up a bunch of work items, wait for 
them to complete 


BeginExecuteParal1 el CD; 

ExecuteParal1 elC g_pParticleSystem, 

&CParticleSystem::update, time ); 
ExecuteParal1 el( &updateRopes, time ); 
EndExecuteParal1 el CD; 


® Low level APIs for the brave 
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Contention 


® What if you can’t eliminate contention over 
shared resources? 

® Example: Allocator 


® Heavily used 

® Multiple pools of fixed sized blocks with a 
custom spin lock mutex per-pool 

® Mutex limiting scale 
® Didn’t want per-thread allocators 
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Contention 


® Lock-free algorithms 

® No thread can block system regardless of 
scheduling or state 

® Under the hood of all services and data 
structures 


® Relies on atomic write instructions, 
“compare-and-swap” 
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Contention 


bool CompareAndSwap(int *pDest, int newvalue, int oldvalue) 

{ 

Lock( pDest ); 

bool success = false; 

if C *pDest == oldvalue ) 

{ 

*pDest = newvalue; 
success = true; 

} 

unlock( pDest ); 
return success; 
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Contention 


bool CompareAndSwapCint *pDest, int newvalue, int oldvalue) 

{ 

_asm 

{ 

mov eax,oldvalue 

mov ecx.pDest 

mov edx,newvalue 

lock cmpxchg [ecx],edx 

mov eax,0 

setz al 

} 
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Contention 


® Use lock-free algorithm in allocator 

® Replace mutex and traditional free list per- 
pool with a lock-free list per-pool 

® Windows API/XDK SList 
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Lock-free example: singly 
linked list 

® Compare-and-swap 

® “If head is equal to what I think it is, assign 
with my new head” 

® ABA Problem: is it the same head? 

© Use a serial number as a discriminating field 
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Lock-free example: singly 
linked list 

class CSList 

{ 

public: 

CSListO 

void Push( SListNode_t *pl\lode ); 

SListNode_t *Pop(); 

SListNode_t *Detach(); 
int CountO const; 
private: 

SListHead_t m_Head; 


}; 
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Lock-free example: singly 
linked list 

struct SListNode_t 

{ 

SListNode_t *pNext; 

}; 


union SListHead_t 
{ 


struct value_t 

{ 

SListNode_t *pNext; 
intl6 iDepth; 
intl6 iSequence; 

} value; 
int64 value64; 
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Lock-free example: singly 
linked list 

void Push( SListNode_t *pNode ) 

{ 

SListHead_t oldHead, newHead; 
for (;;) 

{ 

oldHead.value64 = m_Head.value64; 
newHead.value.iDepth = oldHead.value.iDepth + 1; 
newHead.value.iSequence = oldHead.value.iSequence + 1; 
newHead.value.Next = pNode; 
pNode->pNext = oldHead.value.pNext; 
if ( ThreadlnterlockedAssignlf64( &m_Head.value64, 
newHead.value64, oldHead.value64 ) ) 

{ 

return; 

} 

} 

} 
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Lock-free example: singly 
linked list 

® Lock-free list exceptionally useful 

® Keep pools of context structures when 
impractical to give every thread a context 


® Efficiently gather results of a parallel process 
for later handling 

® Build up lists of data to operate on using 
Push(), then use Detach() (a.k.a “Flush”) to 
grab the data in another thread in a single 
operation 
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Example 


extern Vector trace_start; 
extern vector trace_end; 


// etc... 
struct cbrush_t 
{ 

i nt 

unsigned short 
unsigned short 
i nt 


contents; 
numsides; 
firstbrushside; 

checkcount; // to avoid repeated testings 


/////////////////////////////// 


void BeginTraceO 

{ 

g_CModelMutex.Lock() ; 
++s_nCheckCount; 

} 
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Example 

struct Tracelnfo_t 

{ 

vector m_start; 
vector m_end; 

// etc... 

CVisitBitvec m_Brushvisits; 

}; 


CTraceinfoPool g_TraceinfoPool; 


Tracelnfo_t *BeginTrace() 

{ 

Tracelnfo_t *pTracelnfo; 

if ( !g_TracelnfoPool.PopltemC &pTracelnfo ) ) 
pTracelnfo = new Tracelnfo_t; 


return pTracelnfo; 

} 
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Lock-free algorithms 

® Thread pool work distribution queue 
® Derived from HL2 asynchronous I/O queue 
® Designed for one provider, one consumer 
® Simple prioritized queue with mutex 
® Arbitrary priority 
® One queue for all threads 
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Lock-free algorithms 


® Solutions 

® Use lock-free queue (Fober, et. al.) 

® Rework interface to fixed priorities, one 
queue per-priority 


® Interfaces critical 

® Queues per core in addition to a shared 
queue 


® Use atomic operations to get “ticket”, actual 
work done may differ 
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Lock-free algorithms 


I ® Locks permit a stable reality 

® Lock-free permits reality to change 
instruction to instruction 


® Leverage inference rather than locks to 
know part of the system is stable 


® Wait-free is always better 
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Looking Forward 


T Vftg i ® Why so much up-front investment? 


""aJfi*"*" 


IVA LV E I 


© 2007 Valve Corporation. All Rights Reserved. 


WWW.GDCONF.COM 





Looking Forward 


fa ® Why so much up-front investment? 




® Steam 

® Communicate with customers 
® Tap markets not available via retail 


® Dramatic change is underway 
® Core count double every 18 months 
® CPU/GPU/PPU/AIPU/etc not the future 
® Many homogeneous cores 
® Division of computing power a software problem 
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Call to action 


® Build or acquire strong tools, new techniques 

® Embrace lock-free mechanisms to move work and data to and 
from wait-free code 


® Prepare for decomposition of features over many cores 

® Use accessible solutions to empower all programmers, not just 
systems programmers 

® Support even higher level threading framed in terms of game 
problems 
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Summary 


® Started with a stable but bad threading 

CONTflw- @ iteratively eliminated bad cases using 
variety of techniques, usually lock-free 


® During iterations, expanded toolset to 
meet newly discovered needs 

® Focused on ease-of-use for other 
programmers 


® Now being applied by others at higher 
levels 
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. a® In Source SDK this summer 




® Contact: tom_gdc@valvesoftware.com 
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